360 $^\ circ $视频显着性检测是360 $^\ circ $视频理解的具有挑战性的基准之一,因为不可忽略的失真和不连续性发生在任何格式的360 $^\ circ $视频中,并捕​​获 - 并捕获 - 在全向球体中,值得的观点本质上是模棱两可的。我们提出了一个名为Panoramic Vision Transformer(摊铺机)的新框架。我们使用具有可变形卷积的Vision Transformer设计编码器,这不仅使我们不仅可以将正常视频介绍的模型插入我们的体系结构中,而无需其他模块或填充,而且只能执行一次几何近似,这与以前的基于CNN的深入基于CNN的方法不同。多亏了其功能强大的编码器,摊铺机可以通过本地补丁功能之间的三个简单相对关系来学习显着性,在没有监督或辅助信息(例如类激活)的情况下,通过大幅度的大幅度优于Wild360基准的最先进模型。我们通过VQA-ODV中的全向视频质量评估任务来证明我们的显着性预测模型的实用性,在这里,我们始终在没有任何形式的监督(包括头部运动)的情况下提高性能。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
最近的研究确定,大规模神经语言模型的学识渊博的令牌嵌入被退化为各向异性,形状狭窄。这种现象称为表示变性问题,促进了对模型性能产生负面影响的令牌嵌入之间的总体相似性的增加。尽管基于对问题触发的现象的观察,解决了变性问题的现有方法改善了文本生成的性能,但仍未探索变性问题背后的令牌嵌入的训练动力学。在这项研究中,我们分析了关注稀有令牌嵌入的令牌嵌入的训练动力学。我们证明,稀有令牌嵌入的梯度的特定部分是训练阶段中所有令牌变性问题的关键原因。基于分析,我们提出了一种称为自适应梯度门控(AGG)的新方法。 AGG通过对稀有令牌嵌入的梯度的特定部分进行门控来解决变性问题。语言建模,单词相似性和机器翻译任务的实验结果定量,定性地验证了AGG的有效性。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
Vision Transformers (ViTs) have become a dominant paradigm for visual representation learning with self-attention operators. Although these operators provide flexibility to the model with their adjustable attention kernels, they suffer from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy of the ViT layers, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose a novel attention operator, called lightweight structure-aware attention (LiSA), which has a better representation power with log-linear complexity. Our operator learns structural patterns by using a set of relative position embeddings (RPEs). To achieve log-linear complexity, the RPEs are approximated with fast Fourier transforms. Our experiments and ablation studies demonstrate that ViTs based on the proposed operator outperform self-attention and other existing operators, achieving state-of-the-art results on ImageNet, and competitive results on other visual understanding benchmarks such as COCO and Something-Something-V2. The source code of our approach will be released online.
translated by 谷歌翻译
The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently, Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called Class-Continuous Conditional Generative NeRF ($\text{C}^{3}$G-NeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed $\text{C}^{3}$G-NeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, $\text{C}^{3}$G-NeRF exhibits a Fr\'echet Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a $\text{128}^{2}$ resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with $\text{C}^{3}$G-NeRF.
translated by 谷歌翻译
In both terrestrial and marine ecology, physical tagging is a frequently used method to study population dynamics and behavior. However, such tagging techniques are increasingly being replaced by individual re-identification using image analysis. This paper introduces a contrastive learning-based model for identifying individuals. The model uses the first parts of the Inception v3 network, supported by a projection head, and we use contrastive learning to find similar or dissimilar image pairs from a collection of uniform photographs. We apply this technique for corkwing wrasse, Symphodus melops, an ecologically and commercially important fish species. Photos are taken during repeated catches of the same individuals from a wild population, where the intervals between individual sightings might range from a few days to several years. Our model achieves a one-shot accuracy of 0.35, a 5-shot accuracy of 0.56, and a 100-shot accuracy of 0.88, on our dataset.
translated by 谷歌翻译
Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality and outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.
translated by 谷歌翻译